-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] allow tight control over ports #3994
Conversation
I tested this on a I tried all three possible configurations described above, and all three trainings succeeded. So I'm confident this will work on either |
python-package/lightgbm/dask.py
Outdated
* ``local_listen_port``: port that each LightGBM worker opens a listening socket on, | ||
to accept connections from other workers. This can be differ from LightGBM worker | ||
to LightGBM worker, but does not have to. | ||
* ``machines``: a list of all machines in the cluster, plus a port to communicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This totally just my opinion, but "machines": a list of all machines in the cluster, plus a port to communicate..."
is kind of circular - the second use of "machines" is referring to IP addresses, but the original use of machines
(IMO) is an IP:port combo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah I probably should not use the word "machines" again in the description haha, thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok updated in 25462ea, thanks for catching this
Hey I think these three options make sense and that they're easy to understand! Nice work! Really, I don't have much to add. |
@jameslamb I haven't looked at the diff yet, just read the detailed explanation you've provided. I think I found one inconsistency.
Default value of the |
ok sure, no problem. Done in 0c81f60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! But I have some comments below.
"machine in the cluster has multiple Dask worker processes running on it. Please omit " | ||
"'local_listen_port' or pass 'machines'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"machine in the cluster has multiple Dask worker processes running on it. Please omit " | |
"'local_listen_port' or pass 'machines'." | |
"machine in the cluster has multiple Dask worker processes running on it.\nPlease omit " | |
"'local_listen_port' or pass full configuration via 'machines' parameter." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it is up to you. Feel free to revert new line. I personally don't like long line warnings/errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright I'm not going to accept this suggestion then if it's just a matter a matter of personal preference.
I've had problems in the past with external logs-management systems and log messages that have newline characters. You can read about that general problem at https://www.datadoghq.com/blog/multiline-logging-guide/#the-multi-line-logging-problem if you're interested.
Long log messages will also be wrapped automatically in Jupyter notebooks
and in python
REPLs
python-package/lightgbm/dask.py
Outdated
# * 'num_machines': set automatically from Dask worker list | ||
# * 'num_threads': overridden to match nthreads on each Dask process | ||
for param_alias in _ConfigAliases.get('machines', 'num_machines', 'num_threads'): | ||
for param_alias in _ConfigAliases.get('num_machines', 'num_threads'): | ||
params.pop(param_alias, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is important to notify users about this behavior.
params.pop(param_alias, None) | |
if param_alias in params: | |
_log_warning(f"Parameter {param_alias} will be ignored.") | |
params.pop(param_alias) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I accepted this suggestion but I think now that we should only apply it to num_machines
, not num_threads
.
This results in a warning that users cannot suppress.
/opt/conda/lib/python3.8/site-packages/lightgbm/dask.py:338: UserWarning: Parameter n_jobs will be ignored.
_log_warning(f"Parameter {param_alias} will be ignored.")
Caused by the fact that n_jobs
is an alias of num_threads
LightGBM/python-package/lightgbm/dask.py
Line 489 in 646267d
params = self.get_params(True) LightGBM/python-package/lightgbm/sklearn.py
Line 516 in 646267d
params = super().get_params(deep=deep)
I believe that every warning should be something that can be changed by user code changes. Otherwise, we're just adding noise to logs that might cause people to start filtering out ALL warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important to notify users about num_threads
as well before implementing #3714. Silently ignore parameter is more serious problem compared to unfixable warning, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree in this specific case about the meaning of "ignore", since this is a parameter default and not something explicitly passed in. However, since num_threads
isn't directly related to the purpose of this PR and since I don't want to delay this PR too long because I'd like to merge #3823 soon after it, I'll leave this warning in for now and propose another PR in a few days where we could discuss it further.
dask_model2.fit(dX, dy, group=dg) | ||
else: | ||
dask_model2.fit(dX, dy) | ||
assert dask_model2.fitted_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really assert that
assert dask_model2.get_params()['machines'] == machines
- check somehow that passed workers from
machines
were used and not any other as it is stated in the test name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh good point! ok yes, I can do that.
check somehow that passed workers from machines were used and not any other as it is stated in the test name.
I can test this by binding one of the ports mentioned in machines
, and asserting that training fails with the "failed to bind port" error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I think I've addressed both oof these in my most recent commit. I'm glad you mentioned #1
, because it's something I didn't think about correctly. To fix that, I captured whether machines
was provided explicitly in parameters, and if it was I had _train()
preserve it in the fitted model object. Like this
if not machines_in_params:
for param in _ConfigAliases.get('machines'):
model._other_params.pop(param, None)
Co-authored-by: Nikita Titov <[email protected]>
Sorry for being late to the party, I've had a lot of stuff at work.
I agree with @StrikerRUS here and I really like the proposal of including the def _check_port_is_open(port: int) -> int:
"""Check that port is open on the machine or find a random open one if port is 0."""
try:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('', port))
port = s.getsockname()[1]
except OSError:
# handle port already in use
return port to |
No problem @jmoralez , thanks for all your help so far with the Dask package.
I'm against adding additional code for this in the Python package. You'll already get the informative error "Binding port 12345 failed" if one of the ports you chose is not open. LightGBM/src/network/linkers_socket.cpp Line 131 in 5005b7b
I don't want to encourage or support this pattern. If the communication between workers is not governed by strict firewall rules, The "list IPs explicitly with So I'm against adding a 4th option that is like "if you use |
|
||
@pytest.mark.parametrize('task', tasks) | ||
@pytest.mark.parametrize('output', data_output) | ||
def test_machines_should_be_used_if_provided(task, output): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to make this a separate test from test_network_params_not_required_but_respected_if_given
because all the different context managers (client
fixture + with pytests.raises()
+ with socket.socket()
) were interacting in a strange way, and I was getting frequent teardown errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also why I had to use LocalCluster
instead of the client
fixture for this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
with LocalCluster(n_workers=2) as cluster: | ||
with Client(cluster) as client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just letting you know that you can use one-line with
for multiple variables to not make the code over-indented:
with LocalCluster(n_workers=2) as cluster, Client(cluster) as client:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooo I didn't know you could do that when the second with
references the as
name from the first one, nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made that change in b3c8a2c, thanks
Co-authored-by: Nikita Titov <[email protected]>
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
In #3823, @jmoralez proposed a change for the Dask module that would speed up training by taking a bit of setup overhead that is
O(n_workers)
and make itO(1)
. That discussion led to a conversation about how to give tight control over the ports used by LightGBM, for users who are using a cluster of machines whose communication is limited by firewall rules.I promised to go do some research in LightGBM's source code and come back with a proposal. This PR is that proposal.
Proposed Changes
This PR proposes that
lightgbm.dask
should handle ports as follows.If no network params are given in
params
(the default)LightGBM chooses ports randomly. As of this PR, it will use the current approach on
master
(searching 1000 ports starting with the default value oflocal_listen_port
), but after this PR we should replace that with @jmoralez 's proposal in #3823.If
machines
is provided inparams
LightGBM respects
machines
and does not do any searching. This gives people who are in a constrained environment total control over the ports used, without needing to invent a new parameter specific to the Dask interface.This can work for cases where you run multiple Dask worker processes on the same host. For example, for a
FargateCluster
withn_workers=4, nprocs=2
, you might usemachines
like this:In a follow-up PR after this, to make this easier, I'd propose adding a function
dask_collection_to_machines_param()
, which takes in your training data and a list of ports that you've allowed in your firewall settings, and returns amachines
string you can put intoparams
.If
local_listen_port
is provided inparams
Create
machines
by looking at which worker addresses have a piece of the training data, and assume that each of them will use the same port (local_listen_port
). This will only work in the case where you have 1 Dask worker process per physical host.How this PR improves
lightgbm.dask
This gives an official answer to the question "how I should I used distributed LightGBM training with Dask if my Dask cluster is constrained by firewall rules?", while also making it possible to take advantage of the speedups from #3823 .
It does this in a way that only uses existing LightGBM network parameters, and which doesn't require any notes in documentation that say "hey this parameter has a slightly different meaning in Dask than everywhere else".
How LightGBM's network setup works
This is optional background that will help explain the proposal I'm making. Everything below refers only to the socket-based build of LightGBM.
notes on how the code in src/network works
Each worker runs
Network::Init()
initialize a LightGBM network. The following happens when that method is called.Linkers
machines
list" -->LightGBM/src/network/linkers_socket.cpp
Line 42 in 75b9b0d
LightGBM/src/network/linkers_socket.cpp
Lines 51 to 52 in 75b9b0d
LightGBM/src/network/linkers_socket.cpp
Lines 54 to 60 in 75b9b0d
Linkers::Construct()
to set up these connectionsrank
greater than mine. I know frommachines
which IP addresses to look for and whatlocal_listen_port
they'll be listening on. The workers withrank
less than me will initiate communications with me. Each time I communicate successfully, I'll save that new TCP socket so I can use it during training to talk to that specific worker" -->LightGBM/src/network/linkers_socket.cpp
Lines 196 to 216 in 75b9b0d
LightGBM/src/network/linkers_socket.cpp
Lines 65 to 66 in 75b9b0d
LightGBM/src/network/network.cpp
Lines 37 to 40 in 75b9b0d
You can learn more about this from https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#preparation, https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#id1, and https://lightgbm.readthedocs.io/en/latest/Parameters.html#network-parameters.
Notes for Reviewers
I'll devote a section to this topic in the documentation introduced for #3814 .
The one Dask unit test that's failing will pass if #3993 is accepted and merged.